Leveraging SelectivePCA's weight=True capability

Some algorithms intrinsically treat each feature with the same amount of importance. For many such algorithms, i.e., clustering algorithms, this is a fallacy and can cause inappropriate results. The following notebook demonstrates skutil's weighting capability via SelectivePCA


In [1]:
from __future__ import print_function
import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
sklearn.__version__


Out[1]:
'0.17.1'

Preparing the data for modeling


In [2]:
iris = load_iris()
X, y = iris.data, iris.target # this is unsupervised; we aren't going to split

Basic k-Means, no weighting:

Here, we'll run a basic k-Means (k=3) preceded by a default SelectivePCA (no weighting)


In [5]:
from sklearn.metrics import accuracy_score
from skutil.decomposition import SelectivePCA
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

# define our default pipe
pca = SelectivePCA(n_components=0.99)
pipe = Pipeline([
        ('pca',   pca),
        ('model', KMeans(3))
    ])

# fit the pipe
pipe.fit(X, y)

# predict and score
print('Train accuracy: %.5f' % accuracy_score(y, pipe.predict(X)))


Train accuracy: 0.89333

This is a nice accuracy, but not a stellar one... Surely we can improve this, right? Part of the problem is that clustering (distance metrics) treats all the features equally. Since PCA intrinsically orders features based on importance, we can weight them according to the variability they each explain. Thus, the most important features will be up weighted, and the least important features will be down weighted.

Here is the explained_variance_ratio_ vector:


In [6]:
pca.pca_.explained_variance_ratio_


Out[6]:
array([ 0.92461621,  0.05301557,  0.01718514])

And here's what our weighting vector will ultimately look like:


In [7]:
weights = pca.pca_.explained_variance_ratio_
weights -= np.median(weights)
weights += 1
weights


Out[7]:
array([ 1.87160064,  1.        ,  0.96416957])

k-Means with weighting:


In [10]:
# define our weighted pipe
pca = SelectivePCA(n_components=0.99, weight=True)
pipe = Pipeline([
        ('pca',   pca),
        ('model', KMeans(3))
    ])

# fit the pipe
pipe.fit(X, y)

# predict and score
print('Train accuracy (with weighting): %.5f' % accuracy_score(y, pipe.predict(X)))


Train accuracy (with weighting): 0.90667

Note that this is not limited just to KMeans or even to clustering tasks. Any algorithm that does not intrinsically perform any kind of regularization or other feature selection may be subject to this trap, and SelectivePCA's weighting can help!